perm filename TTT[4,KMC]2 blob
sn#024612 filedate 1973-02-12 generic text, type T, neo UTF8
00100 HOW TO USE AND HOW NOT TO USE TURING-LIKE TESTS
00200 IN EVALUATING THE ADEQUACY OF SIMULATION MODELS
00300
00400 KENNETH MARK COLBY
00500 AND
00600 FRANKLIN DENNIS HILF
00700
00800 It is very easy to become confused about Turing's Test. In
00900 part this is due to Turing himself who introduced the now-famous
01000 imitation game in a 1950 paper entitled COMPUTING MACHINERY AND
01100 INTELLIGENCE [3 ]. A careful reading of this paper reveals there are
01200 actually two games proposed , the second of which is commonly called
01300 Turing's test.
01400 In the first imitation game two groups of judges try to
01500 determine which of two interviewees is a woman. Communication between
01600 judge and interviewee is by teletype. Each judge is initially
01700 informed that one of the interviewees is a woman and one a man who
01800 will pretend to be a woman. After the interview, the judge is asked
01900 what we shall call the woman-question i.e. which interviewee was the
02000 woman? Turing does not say what else the judge is told but one
02100 assumes the judge is NOT told that a computer is involved nor is he
02200 asked to determine which interviewee is human and which is the
02300 computer. Thus, the first group of judges would interview two
02400 interviewees: a woman, and a man pretending to be a woman.
02500 The second group of judges would be given the same initial
02600 instructions, but unbeknownst to them, the two interviewees would be
02700 a woman and a computer programmed to imitate a woman. Both groups
02800 of judges play this game until sufficient statistical data are
02900 collected to show how often the right identification is made. The
03000 crucial question then is: do the judges decide wrongly AS OFTEN when
03100 the game is played with man and woman as when it is played with a
03200 computer substituted for the man. If so, then the program is
03300 considered to have succeeded in imitating a woman as well as a man
03400 imitating a woman. For emphasis we repeat; in asking the
03500 woman-question in this game, judges are not required to identify
03600 which interviewee is human and which is machine.
03700 Later on in his paper Turing proposes a variation of the
03800 first game. In the second game one interviewee is a man and one is a
03900 computer. The judge is asked to determine which is man and which is
04000 machine, which we shall call the machine-question. It is this version
04100 of the game which is commonly thought of as Turing's test. It has
04200 often been suggested as a means of validating computer simulations of
04300 psychological processes.
04400 In the course of testing a simulation (PARRY) of paranoid
04500 linguistic behavior in a psychiatric interview, we conducted a number
04600 of Turing-like indistinguishability tests [1]. We say
04700 `Turing-like' because none of them consisted of playing the two games
04800 described above. We chose not to play these games for a number of
04900 reasons which can be summarized by saying that they do not meet
05000 modern criteria for good experimental design. In designing our tests
05100 we were primarily interested in learning more about developing the
05200 model. We did not believe the simple machine-question to be a useful
05300 one in serving the purpose of progressively increasing the
05400 credibility of the model but we investigated a variation of it to
05500 satisfy the curiosity of colleagues in artificial intelligence.
05600 In this design eight psychiatrists interviewed by teletype
05700 two patients, one being PARRY and one being an actual
05800 hospitalized paranoid patient. The interviewers were not informed
05900 that a simulation was involved nor were they asked to identify which
06000 was the machine. Their task was to conduct a diagnostic psychiatric
06100 interview and rate each response from the `patients' along a 0-9
06200 scale of paranoidness, 0 meaning zero and 9 being highest.
06300 Transcripts of these interviews, without the ratings of the
06400 interviewers, were then utilized for various experiments in which
06500 randomly selected expert judges conducted evaluations of the
06600 interview transcripts. For example, in one experiment it was found
06700 that patients and model were indistinguishable along the dimension of
06800 paranoidness.
06900 To ask the machine-question, we sent interview transcripts,
07000 one with a patient and one with PARRY, to 100 psychiatrists
07100 randomly selected from the Directory of American Specialists and the
07200 Directory of the American Psychiatric Association. Of the 41 replies
07300 21 (51%) made the correct identification while 20 (49%) were wrong.
07400 Based on this random sample of 41 psychiatrists we are 95% confident
07500 that between 35.9% and 66.5% of all psychiatrists could make the
07600 correct identification, a figure which is close to a chance level.
07700 (Our statistical consultant was Dr. Helena C. Kraemer, Research
07800 Associate in Biostatistics, Department of Psychiatry, Stanford
07900 University.)
08000 Psychiatrists are expert judges of patient interview behavior
08100 but they are unfamiliar with computers. Hence we conducted the same
08200 test with 100 computer scientists randomly selected from the
08300 membership list of the Association for Computing Machinery, ACM. Of
08400 the 67 replies 32 (48%) were right and 35 (52%) were wrong. Based on
08500 this random sample of 67 computer scientists we are 95% confident
08600 that between 36% and 60% of all computer scientists could make the
08700 correct identification, a range close to that expected by chance.
08800 Thus the answer to this machine-question "can expert judges,
08900 psychiatrists aand computer scientists, using teletyped transcripts
09000 of psychiatric interviews, distinguish between paranoid patients and
09100 a simulation of paranoid processes? " is "No". But what do we learn
09200 from this? It is some comfort that the answer was not "yes"and the
09300 null hypothesis (no differences) failed to be rejected, especially
09400 since statistical tests are somewhat biased in favor of rejecting the
09500 null hypothesis [3]. Yet this answer does not tell us what we would
09600 most like to know, i.e. how to improve the model. Simulation models
09700 do not spring forth in a complete, perfect and final form; they must
09800 be gradually developed over time. Pehaps we might obtain a "yes"
09900 answer to the machine-question if we allowed a large number of expert
10000 judges to conduct the interviews themselves rather than studying
10100 transcripts of other interviewers. It would indicate that the
10200 model must be improved but unless we systematically investigated how
10300 the judges succeeded in making the discrimination we would not know
10400 what aspects of the model to work on. The logistics of such a design
10500 are immense and obtaining a large N of judges for sound statistical
10600 inference would require an effort disproportionate to the
10700 information-yield.
10800 A more efficient and informative way to use Turing-like tests
10900 is to ask judges to make ordinal ratings along scaled dimensions from
11000 teletyped interviews. We shall term this approach asking the
11100 dimension-question. One can then compare scaled ratings received by
11200 the patients and by the model to precisely determine where and by how
11300 much they differ. Model builders strive for a model which
11400 shows indistinguishability along some dimensions and
11500 distinguishability along others. That is, the model converges on what
11600 it is supposed to simulate and diverges from that which it is not.
11700 We mailed paired-interview transcripts to another 400
11800 randomly selected psychiatrists asking them to rate the responses of
11900 the two `patients' along certain dimensions. The judges were divided
12000 into groups, each judge being asked to rate responses of each I-O
12100 pair in the interviews along four dimensions. The total number of
12200 dimensions in this test were twelve- linguistic noncomprehension,
12300 thought disorder, organic brain syndrome, bizarreness, anger, fear,
12400 ideas of reference, delusions, mistrust, depression, suspiciousness
12500 and mania. These are dimensions which psychiatrists commonly use in
12600 evaluating patients.
12700 Table 1 shows there were significant differences, with PARRY
12800 receiving higher scores along the dimensions of linguistic
12900 noncomprehension,thought disorder, bizarreness, anger, mistrust and
13000 suspiciousness. On the dimension of delusions the patients were rated
13100 significantly higher. There were no significant differences along the
13200 dimensions of organic brain syndrome,fear, ideas of reference,
13300 depression and mania.
13400 While tests asking the machine-question indicate
13500 indistinguishability at the gross level, a study of the finer
13600 structure os the model's behavior through ratings along scaled
13700 dimensions shows statistically significant differences between
13800 patients and model. These differences are of help to the model
13900 builder in suggesting which aspects of the model must be modified and
14000 improved in order to be considered an adequate simulation of the
14100 class of paranoid patients it is intended to simulate. For example,
14200 it is clear that PARRY'S language-comprehension must be improved.
14300 Once this has been implemented, a future test will tell us whether
14400 improvement has occurred and by how much in comparison to the earlier
14500 version. Successive identification of particular areas of failure
14600 in the model permits their improvement and the development of
14700 more adequate model-versions.
14800 Further evidence that the machine-question is too coarse and
14900 insensitive a test comes from the following experiment. In this test
15000 we constructed a random version of the paranoid model which utilized
15100 PARRY'S output statements but expressed them randomly no matter what
15200 the interviewer said. Two psychiatrists conducted interviews with
15300 this model, transcripts of which were paired with patient interviews
15400 and sent to 200 randomly selected psychiatrists asking both the
15500 machine-question and the dimension-question. Of the 69 replies, 34
15600 (49%) were right and 35 (51%) wrong. Based on this random sample of
15700 69 psychiatrists we are 95% confident that between 39% and 63% of all
15800 psychiatrists would make the correct identification, again indicating
15900 a chance level. However as shown in Table 2 significant differences
16000 appear along the dimensions of linguistic noncomprehension, thought
16100 disorder and bizarreness, with RANDOM-PARRY rated higher. On these
16200 particular dimensions we can construct a continuum in which the
16300 random version represents one extreme, the actual patients another.
16400 Our (nonrandom) PARRY lies somewhere between these two extremes,
16500 indicating that it performs significantly better than the random
16600 version but still requires improvement before being indistinguishable
16700 from patients.(See Fig.1). Hence this approach provides yardsticks for
16800 measuring the adequacy of this or any other dialogue simulation model
16900 along the relevant dimensions.
17000 We conclude that when model builders want to conduct tests
17100 which indicate in which direction progress lies and to obtain a
17200 measure of whether progress is being achieved, the way to use
17300 Turing-like tests is to ask expert judges to make ratings along
17400 multiple dimensions considered essential to the model. Useful tests
17500 do not prove a model, they probe it for its sensitivities. Simply
17600 asking the machine-question yields no information relevant to
17700 improving what the model builder knows is only a first approximation.
17800
17900
18000 REFERENCES
18100
18200 [1] Colby, K.M., Hilf,F.D., Weber, S. and Kraemer,H.C. Turing-like
18300 indistinguishability tests for the validation of a computer
18400 simulation of paranoid processes. ARTIFICIAL INTELLIGENCE,3,
18500 (1972),199-221.
18600
18700 [2] Meehl, P.E., Theory testing in psychology and physics: a
18800 methodological paradox. PHILOSOPHY OF SCIENCE,34,(1967),103-115.
18900
19000 [3] Turing,A. Computing machinery and intelligence. Reprinted in:
19100 COMPUTERS AND THOUGHT (Feigenbaum, E.A. and Feldman, J.,eds.).
19200 McGraw-Hill, New York,1963,pp. 11-35.